Summarize the cleaned data¶

  • Tables of descriptive statistics

    The feature summaries displayed in earlier notebooks are no longer valid due to the data-cleaning performed in those notebooks. This notebook displays feature summaries for the cleaned data.

  • Plots of distributions

    As a complement to the summary tables, the notebook also plots the distribution of values for each feature (excluding categorical features that have a large number of distinct values).

In [1]:
from pathlib import Path

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from IPython.display import SVG, display
from matplotlib.ticker import FuncFormatter, MaxNLocator, PercentFormatter
from pandas import Categorical, Series, to_datetime
from pandas.api.types import is_integer_dtype, is_numeric_dtype, is_string_dtype

from notebook_tools.data_cleaning import (
    convert_acc_loan_data,
    filter_acc_loan_data,
    load_acc_loan_data,
    load_acc_loan_feat_desc,
)
from notebook_tools.feature_exploration import (
    get_group_sizes,
    style_loan_summary,
    summarize_acc_loans,
)
from notebook_tools.plots import format_counts, get_wrapped_caption
In [2]:
loan_data = load_acc_loan_data().pipe(convert_acc_loan_data).pipe(filter_acc_loan_data)
In [3]:
feat_desc = load_acc_loan_feat_desc()

Tables of descriptive statistics¶

In [4]:
print(f"\n\nThe number of records is {len(loan_data.index):,d}.")
for dtype in [np.number, "string", "boolean"]:
    summary = summarize_acc_loans(loan_data, dtype, feat_desc)
    print(f"\n\nThe number of features of type {dtype} is {len(summary.index)}.\n\n")
    display(style_loan_summary(summary))

The number of records is 2,217,507.


The number of features of type <class 'numpy.number'> is 110.


  count mean std min 25% 50% 75% max data type description
loan_amnt 2,217,507 15,122.2 9,204.2 1,000.0 8,000.0 13,000.0 20,000.0 40,000.0 Float64 The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt 2,217,507 15,122.0 9,204.1 1,000.0 8,000.0 13,000.0 20,000.0 40,000.0 Float64 The total amount committed to that loan at that point in time.
funded_amnt_inv 2,217,507 15,116.5 9,201.9 725.0 8,000.0 13,000.0 20,000.0 40,000.0 Float64 The total amount committed by investors for that loan at that point in time.
term 2,217,507 42.9 10.9 36.0 36.0 36.0 60.0 60.0 Int64 The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate 2,217,507 13.1 4.8 5.3 9.5 12.6 16.0 31.0 Float64 Interest Rate on the loan
installment 2,217,507 448.2 267.6 7.6 254.0 380.7 595.5 1,719.8 Float64 The monthly payment owed by the borrower if the loan originates.
annual_inc 2,217,507 78,163.7 113,432.4 0.0 46,000.0 65,000.0 93,500.0 110,000,000.0 Float64 The self-reported annual income provided by the borrower during registration.
dti 2,215,797 18.9 14.3 -1.0 12.0 17.9 24.6 999.0 Float64 A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs 2,217,507 0.3 0.9 0.0 0.0 0.0 0.0 58.0 Int64 The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
fico_range_low 2,217,507 698.3 32.9 660.0 675.0 690.0 715.0 845.0 Int64 The lower boundary range the borrower’s FICO at loan origination belongs to.
fico_range_high 2,217,507 702.3 32.9 664.0 679.0 694.0 719.0 850.0 Int64 The upper boundary range the borrower’s FICO at loan origination belongs to.
inq_last_6mths 2,217,506 0.6 0.9 0.0 0.0 0.0 1.0 8.0 Int64 The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
mths_since_last_delinq 1,086,217 34.5 21.9 0.0 16.0 31.0 50.0 226.0 Int64 The number of months since the borrower's last delinquency.
mths_since_last_record 355,456 72.4 26.1 0.0 55.0 74.0 92.0 127.0 Int64 The number of months since the last public record.
open_acc 2,217,507 11.7 5.7 0.0 8.0 11.0 14.0 101.0 Int64 The number of open credit lines in the borrower's credit file.
pub_rec 2,217,507 0.2 0.6 0.0 0.0 0.0 0.0 86.0 Int64 Number of derogatory public records
revol_bal 2,217,507 16,703.6 22,964.0 0.0 5,995.0 11,367.0 20,300.0 2,904,836.0 Float64 Total credit revolving balance
revol_util 2,215,795 50.4 24.6 0.0 31.6 50.3 69.3 892.3 Float64 Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc 2,217,507 24.2 12.0 2.0 15.0 22.0 31.0 176.0 Int64 The total number of credit lines currently in the borrower's credit file
out_prncp 2,217,507 4,287.5 7,390.0 0.0 0.0 0.0 6,367.8 40,000.0 Float64 Remaining outstanding principal for total amount funded
out_prncp_inv 2,217,507 4,286.5 7,389.1 0.0 0.0 0.0 6,365.0 40,000.0 Float64 Remaining outstanding principal for portion of total amount funded by investors
total_pymnt 2,217,507 12,082.8 9,915.8 0.0 4,528.1 9,320.4 16,951.7 63,296.9 Float64 Payments received to date for total amount funded
total_pymnt_inv 2,217,507 12,077.9 9,911.8 0.0 4,526.0 9,315.9 16,943.5 63,296.9 Float64 Payments received to date for portion of total amount funded by investors
total_rec_prncp 2,217,507 9,501.8 8,343.2 0.0 3,000.0 7,000.0 13,900.2 40,000.0 Float64 Principal received to date
total_rec_int 2,217,507 2,434.8 2,681.2 0.0 729.7 1,530.0 3,113.3 28,192.5 Float64 Interest received to date
total_rec_late_fee 2,217,507 1.5 11.9 -0.0 0.0 0.0 0.0 1,484.3 Float64 Late fees received to date
recoveries 2,217,507 144.7 748.4 0.0 0.0 0.0 0.0 39,859.6 Float64 post charge off gross recovery
collection_recovery_fee 2,217,507 24.2 130.6 0.0 0.0 0.0 0.0 7,174.7 Float64 post charge off collection fee
last_pymnt_amnt 2,217,507 3,445.2 6,044.5 0.0 311.4 602.0 3,762.7 42,192.1 Float64 Last total payment amount received
last_fico_range_high 2,217,507 687.6 72.8 0.0 654.0 699.0 734.0 850.0 Int64 The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low 2,217,507 675.5 110.9 0.0 650.0 695.0 730.0 845.0 Int64 The lower boundary range the borrower’s last FICO pulled belongs to.
collections_12_mths_ex_med 2,217,507 0.0 0.2 0.0 0.0 0.0 0.0 20.0 Int64 Number of collections in 12 months excluding medical collections
mths_since_last_major_derog 580,559 44.2 21.5 0.0 27.0 44.0 62.0 226.0 Int64 Months since most recent 90-day or worse rating
annual_inc_joint 120,664 123,628.2 74,170.3 5,693.5 83,400.0 110,000.0 148,000.0 7,874,821.0 Float64 The combined self-reported annual income provided by the co-borrowers during registration
dti_joint 120,660 19.3 7.8 0.0 13.5 18.8 24.6 69.5 Float64 A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-reported monthly income
acc_now_delinq 2,217,507 0.0 0.1 0.0 0.0 0.0 0.0 14.0 Int64 The number of accounts on which the borrower is now delinquent.
tot_coll_amt 2,189,774 232.8 8,519.6 0.0 0.0 0.0 0.0 9,152,545.0 Float64 Total collection amounts ever owed
tot_cur_bal 2,189,774 142,484.6 160,690.5 0.0 29,090.0 79,232.5 213,190.0 9,971,659.0 Float64 Total current balance of all accounts
open_acc_6m 1,394,160 0.9 1.1 0.0 0.0 1.0 1.0 18.0 Int64 Number of open trades in last 6 months
open_act_il 1,394,161 2.8 3.0 0.0 1.0 2.0 3.0 57.0 Int64 Number of currently active installment trades
open_il_12m 1,394,161 0.7 0.9 0.0 0.0 0.0 1.0 25.0 Int64 Number of installment accounts opened in past 12 months
open_il_24m 1,394,161 1.6 1.6 0.0 0.0 1.0 2.0 51.0 Int64 Number of installment accounts opened in past 24 months
mths_since_rcnt_il 1,350,371 21.2 26.1 0.0 7.0 13.0 24.0 511.0 Int64 Months since most recent installment accounts opened
total_bal_il 1,394,161 35,502.6 44,097.4 0.0 8,693.0 23,123.0 46,088.0 1,837,038.0 Float64 Total current balance of all installment accounts
il_util 1,191,472 69.1 23.7 0.0 55.0 72.0 86.0 1,000.0 Float64 Ratio of total current balance to high credit/credit limit on all install acct
open_rv_12m 1,394,161 1.3 1.5 0.0 0.0 1.0 2.0 28.0 Int64 Number of revolving trades opened in past 12 months
open_rv_24m 1,394,161 2.7 2.6 0.0 1.0 2.0 4.0 60.0 Int64 Number of revolving trades opened in past 24 months
max_bal_bc 1,394,161 5,806.5 5,690.9 0.0 2,284.0 4,413.0 7,598.0 1,170,668.0 Float64 Maximum current balance owed on all revolving accounts
all_util 1,393,942 57.0 20.9 0.0 43.0 58.0 72.0 239.0 Float64 Balance to credit limit on all trades
total_rev_hi_lim 2,189,774 34,575.0 36,730.5 0.0 14,700.0 25,400.0 43,200.0 9,999,999.0 Float64 Total revolving high credit/credit limit
inq_fi 1,394,161 1.0 1.5 0.0 0.0 1.0 1.0 48.0 Int64 Number of personal finance inquiries
total_cu_tl 1,394,160 1.5 2.7 0.0 0.0 0.0 2.0 111.0 Int64 Number of finance trades
inq_last_12m 1,394,160 2.0 2.4 0.0 0.0 1.0 3.0 67.0 Int64 Number of credit inquiries in past 12 months
acc_open_past_24mths 2,210,013 4.5 3.2 0.0 2.0 4.0 6.0 64.0 Int64 Number of trades opened in past 24 months.
avg_cur_bal 2,189,704 13,547.5 16,474.3 0.0 3,080.0 7,335.0 18,782.0 958,084.0 Float64 Average current balance of all accounts
bc_open_to_buy 2,185,120 11,395.4 16,600.7 0.0 1,722.0 5,443.0 14,189.0 711,140.0 Float64 Total open to buy on revolving bankcards.
bc_util 2,183,984 57.9 28.6 0.0 35.4 60.2 83.1 339.6 Float64 Ratio of total current balance to high credit/credit limit for all bankcard accounts.
chargeoff_within_12_mths 2,217,507 0.0 0.1 0.0 0.0 0.0 0.0 10.0 Int64 Number of charge-offs within 12 months
delinq_amnt 2,217,507 12.6 733.3 0.0 0.0 0.0 0.0 249,925.0 Float64 The past-due amount owed for the accounts on which the borrower is now delinquent.
mo_sin_old_il_acct 2,120,990 125.7 53.4 0.0 96.0 130.0 154.0 999.0 Int64 Months since oldest bank installment account opened
mo_sin_old_rev_tl_op 2,189,773 181.5 97.1 1.0 116.0 164.0 232.0 999.0 Int64 Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op 2,189,773 14.0 17.5 0.0 4.0 8.0 17.0 547.0 Int64 Months since most recent revolving account opened
mo_sin_rcnt_tl 2,189,774 8.3 9.2 0.0 3.0 6.0 11.0 382.0 Int64 Months since most recent account opened
mort_acc 2,210,013 1.6 1.9 0.0 0.0 1.0 3.0 94.0 Int64 Number of mortgage accounts.
mths_since_recent_bc 2,186,642 24.8 32.3 0.0 6.0 14.0 30.0 661.0 Int64 Months since most recent bankcard account opened.
mths_since_recent_bc_dlq 519,512 39.3 22.6 0.0 21.0 37.0 57.0 202.0 Int64 Months since most recent bankcard delinquency
mths_since_recent_inq 1,964,655 7.0 6.0 0.0 2.0 5.0 11.0 25.0 Int64 Months since most recent inquiry.
mths_since_recent_revol_delinq 740,121 35.8 22.3 0.0 17.0 33.0 51.0 202.0 Int64 Months since most recent revolving delinquency.
num_accts_ever_120_pd 2,189,774 0.5 1.4 0.0 0.0 0.0 0.0 58.0 Int64 Number of accounts ever 120 or more days past due
num_actv_bc_tl 2,189,774 3.7 2.3 0.0 2.0 3.0 5.0 50.0 Int64 Number of currently active bankcard accounts
num_actv_rev_tl 2,189,774 5.6 3.4 0.0 3.0 5.0 7.0 72.0 Int64 Number of currently active revolving trades
num_bc_sats 2,201,457 4.8 3.0 0.0 3.0 4.0 6.0 71.0 Int64 Number of satisfactory bankcard accounts
num_bc_tl 2,189,774 7.7 4.7 0.0 4.0 7.0 10.0 86.0 Int64 Number of bankcard accounts
num_il_tl 2,189,774 8.4 7.4 0.0 3.0 6.0 11.0 159.0 Int64 Number of installment accounts
num_op_rev_tl 2,189,774 8.2 4.7 0.0 5.0 7.0 10.0 91.0 Int64 Number of open revolving accounts
num_rev_accts 2,189,773 14.0 8.0 0.0 8.0 12.0 18.0 151.0 Int64 Number of revolving accounts
num_rev_tl_bal_gt_0 2,189,774 5.6 3.3 0.0 3.0 5.0 7.0 65.0 Int64 Number of revolving trades with balance >0
num_sats 2,201,457 11.6 5.6 0.0 8.0 11.0 14.0 101.0 Int64 Number of satisfactory accounts
num_tl_120dpd_2m 2,106,407 0.0 0.0 0.0 0.0 0.0 0.0 7.0 Int64 Number of accounts currently 120 days past due (updated in past 2 months)
num_tl_30dpd 2,189,774 0.0 0.1 0.0 0.0 0.0 0.0 4.0 Int64 Number of accounts currently 30 days past due (updated in past 2 months)
num_tl_90g_dpd_24m 2,189,774 0.1 0.5 0.0 0.0 0.0 0.0 58.0 Int64 Number of accounts 90 or more days past due in last 24 months
num_tl_op_past_12m 2,189,774 2.1 1.8 0.0 1.0 2.0 3.0 32.0 Int64 Number of accounts opened in past 12 months
pct_tl_nvr_dlq 2,189,619 94.1 9.0 0.0 91.3 100.0 100.0 100.0 Float64 Percent of trades never delinquent
percent_bc_gt_75 2,184,676 42.4 36.2 0.0 0.0 37.5 71.4 100.0 Float64 Percentage of all bankcard accounts > 75% of limit.
pub_rec_bankruptcies 2,217,507 0.1 0.4 0.0 0.0 0.0 0.0 12.0 Int64 Number of public record bankruptcies
tax_liens 2,217,507 0.0 0.4 0.0 0.0 0.0 0.0 85.0 Int64 Number of tax liens
tot_hi_cred_lim 2,189,774 178,236.7 181,574.2 0.0 50,729.0 114,292.0 257,743.0 9,999,999.0 Float64 Total high credit/credit limit
total_bal_ex_mort 2,210,013 51,019.2 49,911.3 0.0 20,890.0 37,861.0 64,343.0 3,408,095.0 Float64 Total credit balance excluding mortgage
total_bc_limit 2,210,013 23,195.1 23,007.8 0.0 8,300.0 16,300.0 30,300.0 1,569,000.0 Float64 Total bankcard high credit/credit limit
total_il_high_credit_limit 2,189,774 43,728.1 45,072.3 0.0 15,000.0 32,692.0 58,798.0 2,118,996.0 Float64 Total installment high credit/credit limit
revol_bal_joint 107,982 33,616.7 28,154.7 0.0 15,107.0 26,515.0 43,765.8 1,110,019.0 Float64 Sum of revolving credit balance of the co-borrowers, net of duplicate balances
sec_app_fico_range_low 107,983 669.8 44.7 540.0 645.0 670.0 695.0 845.0 Int64 FICO range (high) for the secondary applicant
sec_app_fico_range_high 107,983 673.8 44.7 544.0 649.0 674.0 699.0 850.0 Int64 FICO range (low) for the secondary applicant
sec_app_inq_last_6mths 107,983 0.6 1.0 0.0 0.0 0.0 1.0 6.0 Int64 Credit inquiries in the last 6 months at time of application for the secondary applicant
sec_app_mort_acc 107,983 1.5 1.8 0.0 0.0 1.0 2.0 27.0 Int64 Number of mortgage accounts at time of application for the secondary applicant
sec_app_open_acc 107,983 11.5 6.6 0.0 7.0 10.0 15.0 82.0 Int64 Number of open trades at time of application for the secondary applicant
sec_app_revol_util 106,147 58.2 25.5 0.0 39.8 60.2 78.6 434.3 Float64 Ratio of total current balance to high credit/credit limit for all revolving accounts
sec_app_open_act_il 107,983 3.0 3.3 0.0 1.0 2.0 4.0 43.0 Int64 Number of currently active installment trades at time of application for the secondary applicant
sec_app_num_rev_accts 107,983 12.5 8.2 0.0 7.0 11.0 17.0 106.0 Int64 Number of revolving accounts at time of application for the secondary applicant
sec_app_chargeoff_within_12_mths 107,983 0.0 0.4 0.0 0.0 0.0 0.0 21.0 Int64 Number of charge-offs within last 12 months at time of application for the secondary applicant
sec_app_collections_12_mths_ex_med 107,983 0.1 0.4 0.0 0.0 0.0 0.0 23.0 Int64 Number of collections within last 12 months excluding medical collections at time of application for the secondary applicant
sec_app_mths_since_last_major_derog 35,925 36.9 23.9 0.0 16.0 36.0 56.0 185.0 Int64 Months since most recent 90-day or worse rating at time of application for the secondary applicant
hardship_amount 10,913 155.0 129.0 0.6 59.4 119.1 213.2 943.9 Float64 The interest payment that the borrower has committed to make each month while they are on a hardship plan
hardship_dpd 10,913 13.7 9.7 0.0 5.0 15.0 22.0 37.0 Int64 Account days past due as of the hardship plan start date
orig_projected_additional_accrued_interest 8,648 454.7 375.3 1.9 175.2 352.7 620.1 2,680.9 Float64 The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be null if the borrower has broken their hardship payment plan.
hardship_payoff_balance_amount 10,913 11,634.9 7,624.4 55.7 5,627.0 10,028.4 16,143.4 40,306.4 Float64 The payoff balance amount as of the hardship plan start date
hardship_last_payment_amount 10,913 194.0 198.6 0.0 44.4 133.2 284.2 1,407.9 Float64 The last payment amount as of the hardship plan start date
settlement_amount 34,083 5,014.2 3,695.4 44.2 2,209.0 4,148.0 6,856.0 33,601.0 Float64 The loan amount that the borrower has agreed to settle for
settlement_percentage 34,083 47.8 7.2 0.2 45.0 45.0 50.0 521.4 Float64 The settlement amount as a percentage of the payoff balance amount on the loan
settlement_term 34,083 13.2 8.1 0.0 6.0 14.0 18.0 181.0 Int64 The number of months that the borrower will be on the settlement plan

The number of features of type string is 30.


  count unique top freq data type description
id 2,217,507 2,217,507 68407277 1 string A unique LC assigned ID for the loan listing.
grade 2,217,507 7 B 651,022 string LC assigned loan grade
sub_grade 2,217,507 35 C1 143,589 string LC assigned loan subgrade
emp_title 2,053,170 489,609 Teacher 38,818 string The job title supplied by the Borrower when applying for the loan.*
emp_length 2,071,717 11 10+ years 738,438 string Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership 2,217,507 6 MORTGAGE 1,092,198 string The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
verification_status 2,217,507 3 Source Verified 875,634 string Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d 2,217,507 84 2016-03 61,969 string The month which the loan was funded
loan_status 2,217,507 7 Fully Paid 1,042,311 string Current status of the loan
purpose 2,217,507 14 debt_consolidation 1,257,717 string A category provided by the borrower for the loan request.
zip_code 2,217,506 954 112xx 23,362 string The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state 2,217,507 51 CA 307,042 string The state provided by the borrower in the loan application
earliest_cr_line 2,217,507 751 2004-09 15,172 string The month the borrower's earliest reported credit line was opened
initial_list_status 2,217,507 2 w 1,535,014 string The initial listing status of the loan. Possible values are – W, F
last_pymnt_d 2,215,163 86 2019-03 852,741 string Last month payment was received
next_pymnt_d 912,333 8 2019-04 911,945 string Next scheduled payment date
last_credit_pull_d 2,217,439 88 2019-03 1,362,420 string The most recent month LC pulled credit for this loan
application_type 2,217,507 2 Individual 2,096,843 string Indicates whether the loan is an individual application or a joint application with two co-borrowers
verification_status_joint 115,685 3 Not Verified 57,377 string Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified
sec_app_earliest_cr_line 107,983 663 2006-08 998 string Earliest credit line at time of application for the secondary applicant
hardship_reason 10,913 9 NATURAL_DISASTER 2,965 string Describes the reason the hardship plan was offered
hardship_status 10,913 3 COMPLETED 7,816 string Describes if the hardship plan is active, pending, canceled, completed, or broken
hardship_start_date 10,913 27 2017-09 2,443 string The start date of the hardship plan period
hardship_end_date 10,913 28 2017-12 1,755 string The end date of the hardship plan period
payment_plan_start_date 10,913 27 2017-09 1,714 string The day the first hardship plan payment is due. For example, if a borrower has a hardship plan period of 3 months, the start date is the start of the three-month period in which the borrower is allowed to make interest-only payments.
hardship_loan_status 10,913 5 Late (16-30 days) 4,767 string Loan Status as of the hardship plan start date
disbursement_method 2,217,507 2 Cash 2,139,393 string The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY
debt_settlement_flag_date 34,083 70 2019-02 2,605 string The most recent date that the Debt_Settlement_Flag has been set  
settlement_status 34,083 3 ACTIVE 14,701 string The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT
settlement_date 34,083 75 2019-01 1,710 string The date that the borrower agrees to the settlement plan

The number of features of type boolean is 3.


  count unique top freq data type description
pymnt_plan 2,217,507 2 False 2,216,887 boolean Indicates if a payment plan has been put in place for the loan
hardship_flag 2,217,507 2 False 2,216,675 boolean Flags whether or not the borrower is on a hardship plan
debt_settlement_flag 2,217,507 2 False 2,183,424 boolean Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.

Plots of distributions¶

When working in R markdown, I have often used a loop to generate plots showing the distributions of all features. The plots function as a complement to summary tables such as those shown in the previous section of the notebook.

When I tried to use this workflow while plotting with seaborn in Jupyter, I found that memory leakage caused the kernel to die, even when I took care to ensure that all figures where closed at the end of each loop iteration. For a few years there have been posts about memory leakage associated with using matplotlib to generate figures in a loop within Jupyter, and it appears that the only reliable solution is to use a noninteractive backend for matplotlib.

To apply this workaround, each iteration of a loop that generates plots does the following:

  • Use seaborn to generate a figure.
  • Write the figure to disk in SVG format.
  • Load the SVG figure from disk as a string.
  • Pass the SVG string to functions from IPython.display to add the figure to the notebook.

These steps are encapsulated in the function display_svg_plot defined below.

In [5]:
matplotlib.use("svg")
sns.set_theme()
In [6]:
def display_svg_plot(plot):
    temp_svg_path = Path.cwd() / "temp.svg"
    plot.savefig(temp_svg_path, format="svg")
    svg_string = temp_svg_path.read_text()
    display(SVG(data=svg_string))
    plt.close("all")
    temp_svg_path.unlink()

Certain columns contain categorical data with a large number of possible values. Exclude these columns from the loop that generates plots of distribution.

In [7]:
do_not_plot = ["id", "emp_title", "zip_code"]

Define which type of plot to generate for each feature.

In [8]:
line_plots = [
    "issue_d",
    "earliest_cr_line",
    "last_pymnt_d",
    "next_pymnt_d",
    "last_credit_pull_d",
    "sec_app_earliest_cr_line",
    "hardship_start_date",
    "hardship_end_date",
    "payment_plan_start_date",
    "debt_settlement_flag_date",
    "settlement_date",
]

boolean_plots = ["pymnt_plan", "hardship_flag", "debt_settlement_flag"]

logscale_histograms = [
    "installment",
    "annual_inc",
    "dti",
    "revol_bal",
    "revol_util",
    "out_prncp",
    "out_prncp_inv",
    "total_pymnt",
    "total_pymnt_inv",
    "total_rec_int",
    "last_pymnt_amnt",
    "annual_inc_joint",
    "tot_cur_bal",
    "mths_since_rcnt_il",
    "total_bal_il",
    "il_util",
    "max_bal_bc",
    "total_rev_hi_lim",
    "avg_cur_bal",
    "bc_open_to_buy",
    "bc_util",
    "mo_sin_old_il_acct",
    "mo_sin_old_rev_tl_op",
    "mo_sin_rcnt_rev_tl_op",
    "mo_sin_rcnt_tl",
    "mths_since_recent_bc",
    "tot_hi_cred_lim",
    "total_bal_ex_mort",
    "total_bc_limit",
    "total_il_high_credit_limit",
    "revol_bal_joint",
    "settlement_percentage",
]
In [9]:
linear_histograms = [
    column
    for column in loan_data
    if column not in (do_not_plot + line_plots + boolean_plots + logscale_histograms)
]

Define variables used in plot customization.

In [10]:
custom_binwidth = {
    "fico_range_low": 5,
    "fico_range_high": 5,
    "last_fico_range_low": 5,
    "last_fico_range_high": 5,
    "all_util": 2,
    "sec_app_fico_range_low": 5,
    "sec_app_fico_range_high": 5,
    "pct_tl_nvr_dlq": 1,
}

percent_formatting = [
    "pct_tl_nvr_dlq",
    "percent_bc_gt_75",
    "settlement_percentage",
]

angled_labels = [
    "purpose",
    "hardship_reason",
]

custom_ordered_categories = {
    "emp_length": [
        "< 1 year",
        "1 year",
        "2 years",
        "3 years",
        "4 years",
        "5 years",
        "6 years",
        "7 years",
        "8 years",
        "9 years",
        "10+ years",
    ]
}
In [11]:
def plot_distribution(data, column_name, title, caption, **kwargs):
    column_data = data[column_name]

    # If the values are strings, then set the order in which labels will appear along
    # the x-axis.
    if is_string_dtype(column_data.dtype):
        column_data = _convert_to_ordered_categorical(column_data, column_name)

    plot = sns.displot(x=column_data, **kwargs).set(title=title)

    # Tweak tick locations and labels.
    ax = plot.facet_axis(0, 0)
    ax.yaxis.set_major_formatter(FuncFormatter(format_counts))
    if column in percent_formatting:
        ax.xaxis.set_major_formatter(PercentFormatter(decimals=0))
    elif is_numeric_dtype(column_data.dtype) and "log_scale" not in kwargs:
        # Force tick marks to be located on integers.  The call below is equivalent to
        # using matplotlib.ticker.AutoLocator with the constraint that tick marks are
        # integers.
        ax.xaxis.set_major_locator(
            MaxNLocator(nbins="auto", steps=[1, 2, 2.5, 5, 10], integer=True)
        )
        # Add comma separators in formatting tick labels for the x-axis.
        ax.xaxis.set_major_formatter(
            FuncFormatter(
                lambda tick_value, _tick_position: format(int(tick_value), ",")
            )
        )
    if column in angled_labels:
        # The steps to rotate tick labels for the x-axis using matplotlib's
        # object-oriented approach are clumsy, so use pyplot instead.
        plt.xticks(rotation=-45, ha="left")

    display_svg_plot(plot)
    print(f"\n{caption}\n\n\n", flush=True)


def _convert_to_ordered_categorical(col_data, col_name):
    if col_name in custom_ordered_categories:
        ordered_categories = custom_ordered_categories[col_name]
    else:
        ordered_categories = Series(col_data.unique()).dropna().sort_values()
    categorical_data = Categorical(
        col_data, categories=ordered_categories, ordered=True
    )
    categorical_data.name = col_data.name
    return categorical_data
In [12]:
def plot_date_distribution(data, column_name, title, caption, **kwargs):
    to_plot = get_group_sizes(data, group_by=column_name).rename(
        {"count": "Count"}, axis="columns"
    )
    to_plot[column_name] = to_datetime(to_plot[column_name], format="ISO8601")

    kwargs["kind"] = "line"
    if len(to_plot[column_name].index) <= 150:
        kwargs["marker"] = "o"
    plot = sns.relplot(data=to_plot, x=column_name, y="Count", **kwargs).set(
        title=title
    )

    # Tweak formatting of tick labels for the y-axis..
    ax = plot.facet_axis(0, 0)
    ax.yaxis.set_major_formatter(FuncFormatter(format_counts))

    display_svg_plot(plot)
    print(f"\n{caption}\n\n\n", flush=True)
In [13]:
for column in linear_histograms:
    caption = get_wrapped_caption(column, feat_desc, width=200)
    kwargs = {"aspect": 3.3}
    if column in custom_binwidth:
        kwargs["binwidth"] = custom_binwidth[column]
    elif is_integer_dtype(loan_data[column].dtype):
        kwargs["discrete"] = True
    plot_distribution(loan_data, column, f"Distribution of {column}", caption, **kwargs)
No description has been provided for this image
loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.



No description has been provided for this image
funded_amnt: The total amount committed to that loan at that point in time.



No description has been provided for this image
funded_amnt_inv: The total amount committed by investors for that loan at that point in time.



No description has been provided for this image
term: The number of payments on the loan. Values are in months and can be either 36 or 60.



No description has been provided for this image
int_rate: Interest Rate on the loan



No description has been provided for this image
grade: LC assigned loan grade



No description has been provided for this image
sub_grade: LC assigned loan subgrade



No description has been provided for this image
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.



No description has been provided for this image
home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER



No description has been provided for this image
verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified



No description has been provided for this image
loan_status: Current status of the loan



No description has been provided for this image
purpose: A category provided by the borrower for the loan request.



No description has been provided for this image
addr_state: The state provided by the borrower in the loan application



No description has been provided for this image
delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years



No description has been provided for this image
fico_range_low: The lower boundary range the borrower’s FICO at loan origination belongs to.



No description has been provided for this image
fico_range_high: The upper boundary range the borrower’s FICO at loan origination belongs to.



No description has been provided for this image
inq_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)



No description has been provided for this image
mths_since_last_delinq: The number of months since the borrower's last delinquency.



No description has been provided for this image
mths_since_last_record: The number of months since the last public record.



No description has been provided for this image
open_acc: The number of open credit lines in the borrower's credit file.



No description has been provided for this image
pub_rec: Number of derogatory public records



No description has been provided for this image
total_acc: The total number of credit lines currently in the borrower's credit file



No description has been provided for this image
initial_list_status: The initial listing status of the loan. Possible values are – W, F



No description has been provided for this image
total_rec_prncp: Principal received to date



No description has been provided for this image
total_rec_late_fee: Late fees received to date



No description has been provided for this image
recoveries: post charge off gross recovery



No description has been provided for this image
collection_recovery_fee: post charge off collection fee



No description has been provided for this image
last_fico_range_high: The upper boundary range the borrower’s last FICO pulled belongs to.



No description has been provided for this image
last_fico_range_low: The lower boundary range the borrower’s last FICO pulled belongs to.



No description has been provided for this image
collections_12_mths_ex_med: Number of collections in 12 months excluding medical collections



No description has been provided for this image
mths_since_last_major_derog: Months since most recent 90-day or worse rating



No description has been provided for this image
application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers



No description has been provided for this image
dti_joint: A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-
reported monthly income



No description has been provided for this image
verification_status_joint: Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified



No description has been provided for this image
acc_now_delinq: The number of accounts on which the borrower is now delinquent.



No description has been provided for this image
tot_coll_amt: Total collection amounts ever owed



No description has been provided for this image
open_acc_6m: Number of open trades in last 6 months



No description has been provided for this image
open_act_il: Number of currently active installment trades



No description has been provided for this image
open_il_12m: Number of installment accounts opened in past 12 months



No description has been provided for this image
open_il_24m: Number of installment accounts opened in past 24 months



No description has been provided for this image
open_rv_12m: Number of revolving trades opened in past 12 months



No description has been provided for this image
open_rv_24m: Number of revolving trades opened in past 24 months



No description has been provided for this image
all_util: Balance to credit limit on all trades



No description has been provided for this image
inq_fi: Number of personal finance inquiries



No description has been provided for this image
total_cu_tl: Number of finance trades



No description has been provided for this image
inq_last_12m: Number of credit inquiries in past 12 months



No description has been provided for this image
acc_open_past_24mths: Number of trades opened in past 24 months.



No description has been provided for this image
chargeoff_within_12_mths: Number of charge-offs within 12 months



No description has been provided for this image
delinq_amnt: The past-due amount owed for the accounts on which the borrower is now delinquent.



No description has been provided for this image
mort_acc: Number of mortgage accounts.



No description has been provided for this image
mths_since_recent_bc_dlq: Months since most recent bankcard delinquency



No description has been provided for this image
mths_since_recent_inq: Months since most recent inquiry.



No description has been provided for this image
mths_since_recent_revol_delinq: Months since most recent revolving delinquency.



No description has been provided for this image
num_accts_ever_120_pd: Number of accounts ever 120 or more days past due



No description has been provided for this image
num_actv_bc_tl: Number of currently active bankcard accounts



No description has been provided for this image
num_actv_rev_tl: Number of currently active revolving trades



No description has been provided for this image
num_bc_sats: Number of satisfactory bankcard accounts



No description has been provided for this image
num_bc_tl: Number of bankcard accounts



No description has been provided for this image
num_il_tl: Number of installment accounts



No description has been provided for this image
num_op_rev_tl: Number of open revolving accounts



No description has been provided for this image
num_rev_accts: Number of revolving accounts



No description has been provided for this image
num_rev_tl_bal_gt_0: Number of revolving trades with balance >0



No description has been provided for this image
num_sats: Number of satisfactory accounts



No description has been provided for this image
num_tl_120dpd_2m: Number of accounts currently 120 days past due (updated in past 2 months)



No description has been provided for this image
num_tl_30dpd: Number of accounts currently 30 days past due (updated in past 2 months)



No description has been provided for this image
num_tl_90g_dpd_24m: Number of accounts 90 or more days past due in last 24 months



No description has been provided for this image
num_tl_op_past_12m: Number of accounts opened in past 12 months



No description has been provided for this image
pct_tl_nvr_dlq: Percent of trades never delinquent



No description has been provided for this image
percent_bc_gt_75: Percentage of all bankcard accounts > 75% of limit.



No description has been provided for this image
pub_rec_bankruptcies: Number of public record bankruptcies



No description has been provided for this image
tax_liens: Number of tax liens



No description has been provided for this image
sec_app_fico_range_low:  FICO range (high) for the secondary applicant



No description has been provided for this image
sec_app_fico_range_high:  FICO range (low) for the secondary applicant



No description has been provided for this image
sec_app_inq_last_6mths:  Credit inquiries in the last 6 months at time of application for the secondary applicant



No description has been provided for this image
sec_app_mort_acc:  Number of mortgage accounts at time of application for the secondary applicant



No description has been provided for this image
sec_app_open_acc:  Number of open trades at time of application for the secondary applicant



No description has been provided for this image
sec_app_revol_util:  Ratio of total current balance to high credit/credit limit for all revolving accounts



No description has been provided for this image
sec_app_open_act_il:  Number of currently active installment trades at time of application for the secondary applicant



No description has been provided for this image
sec_app_num_rev_accts:  Number of revolving accounts at time of application for the secondary applicant



No description has been provided for this image
sec_app_chargeoff_within_12_mths:  Number of charge-offs within last 12 months at time of application for the secondary applicant



No description has been provided for this image
sec_app_collections_12_mths_ex_med:  Number of collections within last 12 months excluding medical collections at time of application for the secondary applicant



No description has been provided for this image
sec_app_mths_since_last_major_derog:  Months since most recent 90-day or worse rating at time of application for the secondary applicant



No description has been provided for this image
hardship_reason: Describes the reason the hardship plan was offered



No description has been provided for this image
hardship_status: Describes if the hardship plan is active, pending, canceled, completed, or broken



No description has been provided for this image
hardship_amount: The interest payment that the borrower has committed to make each month while they are on a hardship plan



No description has been provided for this image
hardship_dpd: Account days past due as of the hardship plan start date



No description has been provided for this image
hardship_loan_status: Loan Status as of the hardship plan start date



No description has been provided for this image
orig_projected_additional_accrued_interest: The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be
null if the borrower has broken their hardship payment plan.



No description has been provided for this image
hardship_payoff_balance_amount: The payoff balance amount as of the hardship plan start date



No description has been provided for this image
hardship_last_payment_amount: The last payment amount as of the hardship plan start date



No description has been provided for this image
disbursement_method: The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY



No description has been provided for this image
settlement_status: The status of the borrower’s settlement plan. Possible values are: COMPLETE, ACTIVE, BROKEN, CANCELLED, DENIED, DRAFT



No description has been provided for this image
settlement_amount: The loan amount that the borrower has agreed to settle for



No description has been provided for this image
settlement_term: The number of months that the borrower will be on the settlement plan



In [14]:
for column in logscale_histograms:
    caption = get_wrapped_caption(column, feat_desc, width=200)
    kwargs = {"aspect": 3.3, "log_scale": True, "bins": 50}
    plot_distribution(loan_data, column, f"Distribution of {column}", caption, **kwargs)
No description has been provided for this image
installment: The monthly payment owed by the borrower if the loan originates.



No description has been provided for this image
annual_inc: The self-reported annual income provided by the borrower during registration.



No description has been provided for this image
dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly
income.



No description has been provided for this image
revol_bal: Total credit revolving balance



No description has been provided for this image
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.



No description has been provided for this image
out_prncp: Remaining outstanding principal for total amount funded



No description has been provided for this image
out_prncp_inv: Remaining outstanding principal for portion of total amount funded by investors



No description has been provided for this image
total_pymnt: Payments received to date for total amount funded



No description has been provided for this image
total_pymnt_inv: Payments received to date for portion of total amount funded by investors



No description has been provided for this image
total_rec_int: Interest received to date



No description has been provided for this image
last_pymnt_amnt: Last total payment amount received



No description has been provided for this image
annual_inc_joint: The combined self-reported annual income provided by the co-borrowers during registration



No description has been provided for this image
tot_cur_bal: Total current balance of all accounts



No description has been provided for this image
mths_since_rcnt_il: Months since most recent installment accounts opened



No description has been provided for this image
total_bal_il: Total current balance of all installment accounts



No description has been provided for this image
il_util: Ratio of total current balance to high credit/credit limit on all install acct



No description has been provided for this image
max_bal_bc: Maximum current balance owed on all revolving accounts



No description has been provided for this image
total_rev_hi_lim: Total revolving high credit/credit limit



No description has been provided for this image
avg_cur_bal: Average current balance of all accounts



No description has been provided for this image
bc_open_to_buy: Total open to buy on revolving bankcards.



No description has been provided for this image
bc_util: Ratio of total current balance to high credit/credit limit for all bankcard accounts.



No description has been provided for this image
mo_sin_old_il_acct: Months since oldest bank installment account opened



No description has been provided for this image
mo_sin_old_rev_tl_op: Months since oldest revolving account opened



No description has been provided for this image
mo_sin_rcnt_rev_tl_op: Months since most recent revolving account opened



No description has been provided for this image
mo_sin_rcnt_tl: Months since most recent account opened



No description has been provided for this image
mths_since_recent_bc: Months since most recent bankcard account opened.



No description has been provided for this image
tot_hi_cred_lim: Total high credit/credit limit



No description has been provided for this image
total_bal_ex_mort: Total credit balance excluding mortgage



No description has been provided for this image
total_bc_limit: Total bankcard high credit/credit limit



No description has been provided for this image
total_il_high_credit_limit: Total installment high credit/credit limit



No description has been provided for this image
revol_bal_joint:  Sum of revolving credit balance of the co-borrowers, net of duplicate balances



No description has been provided for this image
settlement_percentage: The settlement amount as a percentage of the payoff balance amount on the loan



In [15]:
for column in boolean_plots:
    caption = get_wrapped_caption(column, feat_desc, width=200)
    kwargs = {"aspect": 3.3}
    mapper = {True: "True", False: "False"}
    plot_distribution(
        loan_data[column].map(mapper).to_frame(),
        column,
        f"Distribution of {column}",
        caption,
        **kwargs,
    )
No description has been provided for this image
pymnt_plan: Indicates if a payment plan has been put in place for the loan



No description has been provided for this image
hardship_flag: Flags whether or not the borrower is on a hardship plan



No description has been provided for this image
debt_settlement_flag: Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company.



In [16]:
for column in line_plots:
    caption = get_wrapped_caption(column, feat_desc, width=200)
    kwargs = {"aspect": 3.3}
    plot_date_distribution(
        loan_data, column, f"Distribution of {column}", caption, **kwargs
    )
No description has been provided for this image
issue_d: The month which the loan was funded



No description has been provided for this image
earliest_cr_line: The month the borrower's earliest reported credit line was opened



No description has been provided for this image
last_pymnt_d: Last month payment was received



No description has been provided for this image
next_pymnt_d: Next scheduled payment date



No description has been provided for this image
last_credit_pull_d: The most recent month LC pulled credit for this loan



No description has been provided for this image
sec_app_earliest_cr_line:  Earliest credit line at time of application for the secondary applicant



No description has been provided for this image
hardship_start_date: The start date of the hardship plan period



No description has been provided for this image
hardship_end_date: The end date of the hardship plan period



No description has been provided for this image
payment_plan_start_date: The day the first hardship plan payment is due. For example, if a borrower has a hardship plan period of 3 months, the start date is the start of the three-month period in
which the borrower is allowed to make interest-only payments.



No description has been provided for this image
debt_settlement_flag_date: The most recent date that the Debt_Settlement_Flag has been set  



No description has been provided for this image
settlement_date: The date that the borrower agrees to the settlement plan